This code covers chapter 2 of “Introduction to Data Mining” by Pang-Ning Tan, Michael Steinbach and Vipin Kumar. See table of contents for code examples for other chapters.

CC This work is licensed under the Creative Commons Attribution 4.0 International License. For questions please contact Michael Hahsler.

Tidyverse

Some of the code uses tidyverse tibbles to replace data.frames, the pipe operator %>% to chain functions and data transformation functions like filter(), arrange(), select(), and mutate() provided by the tidyverse package dplyr. A good overview is given in the RStudio Data Transformation Cheat Sheet and an introduction can be found in the Section on Data Wrangling the free book R for Data Science.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(GGally) # for ggpairs
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Data Preparation: The Iris Dataset

We will use a toy dataset that comes with R. Fisher’s iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. For more details see: ? iris

Load the iris data set and convert the data.frame into a tibble. Note: datasets that come with R or R packages can be loaded with data().

data(iris)
iris <- as_tibble(iris)
iris
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows

Data Quality

Inspect data (produce a scatterplot matrix using ggpairs from package GGally). Possibly you can see noise and ouliers.

ggpairs(iris, aes(color = Species))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Get summary statistics for each column (outliers, missing values)

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

just the mean

iris %>% summarize_if(is.numeric, mean)
## # A tibble: 1 x 4
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
##          <dbl>       <dbl>        <dbl>       <dbl>
## 1         5.84        3.06         3.76        1.20

Often you will do something like:

clean.data <- iris %>% drop_na() %>% unique()
summary(clean.data)
##   Sepal.Length    Sepal.Width    Petal.Length    Petal.Width          Species  
##  Min.   :4.300   Min.   :2.00   Min.   :1.000   Min.   :0.100   setosa    :50  
##  1st Qu.:5.100   1st Qu.:2.80   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
##  Median :5.800   Median :3.00   Median :4.300   Median :1.300   virginica :49  
##  Mean   :5.844   Mean   :3.06   Mean   :3.749   Mean   :1.195                  
##  3rd Qu.:6.400   3rd Qu.:3.30   3rd Qu.:5.100   3rd Qu.:1.800                  
##  Max.   :7.900   Max.   :4.40   Max.   :6.900   Max.   :2.500

Note that one case (non-unique) is gone. All cases with missing values will also have been dropped.

Aggregation

Aggregate by species. First group the data and then summarize each group.

iris %>% group_by(Species) %>% summarize_all(mean)
## # A tibble: 3 x 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
## * <fct>             <dbl>       <dbl>        <dbl>       <dbl>
## 1 setosa             5.01        3.43         1.46       0.246
## 2 versicolor         5.94        2.77         4.26       1.33 
## 3 virginica          6.59        2.97         5.55       2.03
iris %>% group_by(Species) %>% summarize_all(median)
## # A tibble: 3 x 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
## * <fct>             <dbl>       <dbl>        <dbl>       <dbl>
## 1 setosa              5           3.4         1.5          0.2
## 2 versicolor          5.9         2.8         4.35         1.3
## 3 virginica           6.5         3           5.55         2

Sampling

Random sampling

Sample from a vector with replacement.

sample(c("A", "B", "C"), size = 10, replace = TRUE)
##  [1] "B" "B" "B" "C" "B" "C" "A" "B" "C" "A"

Sampling rows from a tibble (I set the random number generator seed to make the results reproducible).

set.seed(1000)

s <- iris %>% sample_n(15)
ggpairs(s, aes(color = Species))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Stratified sampling

You need to install the package sampling with: install.packages(“sampling”)

library(sampling)
id2 <- strata(iris, stratanames="Species", size=c(5,5,5), method="srswor")
id2
##        Species ID_unit Prob Stratum
## 7       setosa       7  0.1       1
## 9       setosa       9  0.1       1
## 10      setosa      10  0.1       1
## 24      setosa      24  0.1       1
## 48      setosa      48  0.1       1
## 58  versicolor      58  0.1       2
## 62  versicolor      62  0.1       2
## 74  versicolor      74  0.1       2
## 78  versicolor      78  0.1       2
## 99  versicolor      99  0.1       2
## 106  virginica     106  0.1       3
## 107  virginica     107  0.1       3
## 127  virginica     127  0.1       3
## 135  virginica     135  0.1       3
## 145  virginica     145  0.1       3
s2 <- iris %>% slice(id2$ID_unit)
ggpairs(s2, aes(color = Species))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Features

Dimensionality reduction

Principal Components Analysis (PCA) represents high-dimensional data in fewer dimensions. Points that are close together in the high-dimensional space, tend also be close together in the lower-dimensional space. We often use two dimensions for visualizing high-dimensional data as a scatter plot.

Look at the 3d data using an interactive 3d plot (needs package plotly)

#library(plotly) # I don't load the package because it's namespace clashes with select in dplyr.
plotly::plot_ly(iris, x = ~Sepal.Length, y = ~Petal.Length, z = ~Sepal.Width,
  size = ~Petal.Width, color = ~Species, type="scatter3d",
 mode="markers")
## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

Calculate the principal components

pc <- iris %>% select(-Species) %>% as.matrix() %>% prcomp()

How important is each principal component?

plot(pc)

Inspect the raw object (display structure)

str(pc)
## List of 5
##  $ sdev    : num [1:4] 2.056 0.493 0.28 0.154
##  $ rotation: num [1:4, 1:4] 0.3614 -0.0845 0.8567 0.3583 -0.6566 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
##  $ center  : Named num [1:4] 5.84 3.06 3.76 1.2
##   ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##  $ scale   : logi FALSE
##  $ x       : num [1:150, 1:4] -2.68 -2.71 -2.89 -2.75 -2.73 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
##  - attr(*, "class")= chr "prcomp"
ggplot(as_tibble(pc$x), aes(x = PC1, y = PC2, color = iris$Species)) + geom_point()

Plot the projected data and add the original dimensions as arrows (this can be done with ggplot2, but is currently painful; see https://stackoverflow.com/questions/6578355/plotting-pca-biplot-with-ggplot2).

biplot(pc, col = c("grey", "red"))

Another popular method to project data in lower dimensions for visualization is t-distributed stochastic neighbor embedding (t-SNE) available in package Rtsne.

Multi-dimensional scaling (MDS) does the reverse process. It takes distances and produces a space where points are placed to represent these distances as well as possible. Let’s calculate distances in the 4-d space of iris.

d <- iris %>% select(-Species) %>% dist()

and do metric (classic) MDS to reconstruct a 2-d space.

fit <- cmdscale(d, k = 2)
colnames(fit) <- c("comp1", "comp2")
fit <- as_tibble(fit)

ggplot(fit, aes(x = comp1, y = comp2, color = iris$Species)) + geom_point()

Non-parametric scaling is available in package MASS as functions isoMDS and sammon.

Feature selection

We will talk about feature selection when we discuss classification models.

Discretize features

ggplot(iris, aes(x = Petal.Width, y = 1:150)) + geom_point()

A histogram is a better visualization for the distribution of a single variable.

ggplot(iris, aes(Petal.Width)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Equal interval width

iris %>% pull(Sepal.Width) %>% cut(breaks=3)
##   [1] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (3.6,4.4] (2.8,3.6]
##   [8] (2.8,3.6] (2.8,3.6] (2.8,3.6] (3.6,4.4] (2.8,3.6] (2.8,3.6] (2.8,3.6]
##  [15] (3.6,4.4] (3.6,4.4] (3.6,4.4] (2.8,3.6] (3.6,4.4] (3.6,4.4] (2.8,3.6]
##  [22] (3.6,4.4] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6]
##  [29] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (3.6,4.4] (3.6,4.4] (2.8,3.6]
##  [36] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2,2.8]  
##  [43] (2.8,3.6] (2.8,3.6] (3.6,4.4] (2.8,3.6] (3.6,4.4] (2.8,3.6] (3.6,4.4]
##  [50] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2,2.8]   (2,2.8]   (2,2.8]  
##  [57] (2.8,3.6] (2,2.8]   (2.8,3.6] (2,2.8]   (2,2.8]   (2.8,3.6] (2,2.8]  
##  [64] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2,2.8]   (2,2.8]   (2,2.8]  
##  [71] (2.8,3.6] (2,2.8]   (2,2.8]   (2,2.8]   (2.8,3.6] (2.8,3.6] (2,2.8]  
##  [78] (2.8,3.6] (2.8,3.6] (2,2.8]   (2,2.8]   (2,2.8]   (2,2.8]   (2,2.8]  
##  [85] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2,2.8]   (2.8,3.6] (2,2.8]   (2,2.8]  
##  [92] (2.8,3.6] (2,2.8]   (2,2.8]   (2,2.8]   (2.8,3.6] (2.8,3.6] (2.8,3.6]
##  [99] (2,2.8]   (2,2.8]   (2.8,3.6] (2,2.8]   (2.8,3.6] (2.8,3.6] (2.8,3.6]
## [106] (2.8,3.6] (2,2.8]   (2.8,3.6] (2,2.8]   (2.8,3.6] (2.8,3.6] (2,2.8]  
## [113] (2.8,3.6] (2,2.8]   (2,2.8]   (2.8,3.6] (2.8,3.6] (3.6,4.4] (2,2.8]  
## [120] (2,2.8]   (2.8,3.6] (2,2.8]   (2,2.8]   (2,2.8]   (2.8,3.6] (2.8,3.6]
## [127] (2,2.8]   (2.8,3.6] (2,2.8]   (2.8,3.6] (2,2.8]   (3.6,4.4] (2,2.8]  
## [134] (2,2.8]   (2,2.8]   (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6]
## [141] (2.8,3.6] (2.8,3.6] (2,2.8]   (2.8,3.6] (2.8,3.6] (2.8,3.6] (2,2.8]  
## [148] (2.8,3.6] (2.8,3.6] (2.8,3.6]
## Levels: (2,2.8] (2.8,3.6] (3.6,4.4]

Other methods (equal frequency, k-means clustering, etc.)

library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
iris %>% pull(Petal.Width) %>% discretize(method = "interval", breaks = 3)
##   [1] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##   [8] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [15] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [22] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [29] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [36] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [43] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [50] [0.1,0.9) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [57] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [64] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [71] [1.7,2.5] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [78] [1.7,2.5] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [85] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [92] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [99] [0.9,1.7) [0.9,1.7) [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [106] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [113] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [120] [0.9,1.7) [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [127] [1.7,2.5] [1.7,2.5] [1.7,2.5] [0.9,1.7) [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [134] [0.9,1.7) [0.9,1.7) [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [141] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [148] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## attr(,"discretized:breaks")
## [1] 0.1 0.9 1.7 2.5
## attr(,"discretized:method")
## [1] interval
## Levels: [0.1,0.9) [0.9,1.7) [1.7,2.5]
iris %>% pull(Petal.Width) %>% discretize(method = "frequency", breaks = 3)
##   [1] [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867)
##   [7] [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867)
##  [13] [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867)
##  [19] [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867)
##  [25] [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867)
##  [31] [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867)
##  [37] [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867)
##  [43] [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867)
##  [49] [0.1,0.867) [0.1,0.867) [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6)
##  [55] [0.867,1.6) [0.867,1.6) [1.6,2.5]   [0.867,1.6) [0.867,1.6) [0.867,1.6)
##  [61] [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6)
##  [67] [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [1.6,2.5]   [0.867,1.6)
##  [73] [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [1.6,2.5]  
##  [79] [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [1.6,2.5]  
##  [85] [0.867,1.6) [1.6,2.5]   [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6)
##  [91] [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6)
##  [97] [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [1.6,2.5]   [1.6,2.5]  
## [103] [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]  
## [109] [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]  
## [115] [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [0.867,1.6)
## [121] [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]  
## [127] [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]  
## [133] [1.6,2.5]   [0.867,1.6) [0.867,1.6) [1.6,2.5]   [1.6,2.5]   [1.6,2.5]  
## [139] [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]  
## [145] [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]  
## attr(,"discretized:breaks")
## [1] 0.1000000 0.8666667 1.6000000 2.5000000
## attr(,"discretized:method")
## [1] frequency
## Levels: [0.1,0.867) [0.867,1.6) [1.6,2.5]
iris %>% pull(Petal.Width) %>% discretize(method = "cluster", breaks = 3)
##   [1] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##   [6] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##  [11] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##  [16] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##  [21] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##  [26] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##  [31] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##  [36] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##  [41] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##  [46] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##  [51] [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
##  [56] [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
##  [61] [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
##  [66] [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
##  [71] [1.71,2.5]   [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
##  [76] [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
##  [81] [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
##  [86] [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
##  [91] [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
##  [96] [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
## [101] [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]  
## [106] [1.71,2.5]   [0.792,1.71) [1.71,2.5]   [1.71,2.5]   [1.71,2.5]  
## [111] [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]  
## [116] [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [0.792,1.71)
## [121] [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]  
## [126] [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [0.792,1.71)
## [131] [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [0.792,1.71) [0.792,1.71)
## [136] [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]  
## [141] [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]  
## [146] [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]  
## attr(,"discretized:breaks")
## [1] 0.1000000 0.7915185 1.7054750 2.5000000
## attr(,"discretized:method")
## [1] cluster
## Levels: [0.1,0.792) [0.792,1.71) [1.71,2.5]
ggplot(iris, aes(Petal.Width)) + geom_histogram() +
  geom_vline(xintercept =
      iris %>% pull(Petal.Width) %>% discretize(method = "interval", breaks = 3, onlycuts = TRUE),
    color = "blue") +
  labs(title = "Discretization: interval", subtitle = "Blue lines are boundaries")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(iris, aes(Petal.Width)) + geom_histogram() +
  geom_vline(xintercept =
      iris %>% pull(Petal.Width) %>% discretize(method = "frequency", breaks = 3, onlycuts = TRUE),
    color = "blue") +
  labs(title = "Discretization: frequency", subtitle = "Blue lines are boundaries")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(iris, aes(Petal.Width)) + geom_histogram() +
  geom_vline(xintercept =
      iris %>% pull(Petal.Width) %>% discretize(method = "cluster", breaks = 3, onlycuts = TRUE),
    color = "blue") +
  labs(title = "Discretization: cluster", subtitle = "Blue lines are boundaries")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Standardize data (Z-score)

Standardize the scale of features to make them comparable. For each column the mean is subtracted (centering) and it is divided by the standard deviation (scaling). Now most values should be in [-3,3].

Note: tidyverse currently does not have a simple scale function, so I make one that provides a wrapper for the standard scale function in R:

scale_numeric <- function(x) x %>% mutate_if(is.numeric, function(y) as.vector(scale(y)))

iris.scaled <- iris %>% scale_numeric()
iris.scaled
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1       -0.898      1.02          -1.34       -1.31 setosa 
##  2       -1.14      -0.132         -1.34       -1.31 setosa 
##  3       -1.38       0.327         -1.39       -1.31 setosa 
##  4       -1.50       0.0979        -1.28       -1.31 setosa 
##  5       -1.02       1.25          -1.34       -1.31 setosa 
##  6       -0.535      1.93          -1.17       -1.05 setosa 
##  7       -1.50       0.786         -1.34       -1.18 setosa 
##  8       -1.02       0.786         -1.28       -1.31 setosa 
##  9       -1.74      -0.361         -1.34       -1.31 setosa 
## 10       -1.14       0.0979        -1.28       -1.44 setosa 
## # … with 140 more rows
summary(iris.scaled)
##   Sepal.Length       Sepal.Width       Petal.Length      Petal.Width     
##  Min.   :-1.86378   Min.   :-2.4258   Min.   :-1.5623   Min.   :-1.4422  
##  1st Qu.:-0.89767   1st Qu.:-0.5904   1st Qu.:-1.2225   1st Qu.:-1.1799  
##  Median :-0.05233   Median :-0.1315   Median : 0.3354   Median : 0.1321  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.67225   3rd Qu.: 0.5567   3rd Qu.: 0.7602   3rd Qu.: 0.7880  
##  Max.   : 2.48370   Max.   : 3.0805   Max.   : 1.7799   Max.   : 1.7064  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Proximities: Similarities and distances

Note: R actually only uses dissimilarities/distances.

Minkovsky distances

iris_sample <- iris.scaled %>% select(-Species) %>% slice(1:5)
iris_sample
## # A tibble: 5 x 4
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
##          <dbl>       <dbl>        <dbl>       <dbl>
## 1       -0.898      1.02          -1.34       -1.31
## 2       -1.14      -0.132         -1.34       -1.31
## 3       -1.38       0.327         -1.39       -1.31
## 4       -1.50       0.0979        -1.28       -1.31
## 5       -1.02       1.25          -1.34       -1.31

Calculate distances matrices between the first 5 flowers (use only the 4 numeric columns).

iris_sample %>% dist(method="euclidean")
##           1         2         3         4
## 2 1.1722914                              
## 3 0.8427840 0.5216255                    
## 4 1.0999999 0.4325508 0.2829432          
## 5 0.2592702 1.3818560 0.9882608 1.2459861
iris_sample %>% dist(method="manhattan")
##           1         2         3         4
## 2 1.3886674                              
## 3 1.2279853 0.7570306                    
## 4 1.5781768 0.6483657 0.4634868          
## 5 0.3501915 1.4973323 1.3366502 1.6868417
iris_sample %>% dist(method="maximum")
##           1         2         3         4
## 2 1.1471408                              
## 3 0.6882845 0.4588563                    
## 4 0.9177126 0.3622899 0.2294282          
## 5 0.2294282 1.3765690 0.9177126 1.1471408

Note: Don’t forget to scale the data if the ranges are very different!

Distances for binary data (Jaccard and Hamming)

b <- rbind(
  c(0,0,0,1,1,1,1,0,0,1),
  c(0,0,1,1,1,0,0,1,0,0)
  )
b
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    0    1    1    1    1    0    0     1
## [2,]    0    0    1    1    1    0    0    1    0     0

Jaccard index

Jaccard index is a similarity measure so R reports 1-Jaccard

b %>% dist(method = "binary")
##           1
## 2 0.7142857

Hamming distance

Hamming distance is the number of mis-matches (equivalent to Manhattan distance on 0-1 data and also the squared Euclidean distance).

b %>% dist(method = "manhattan")
##   1
## 2 5
b %>% dist(method = "euclidean") %>% "^"(2)
##   1
## 2 5

Note: "^"(2) calculates the square.

Distances for mixed data

Gower’s distance

Works with mixed data

data <- tibble(
  height= c(      160,    185,    170),
  weight= c(       52,     90,     75),
  sex=    c( "female", "male", "male")
)
data
## # A tibble: 3 x 3
##   height weight sex   
##    <dbl>  <dbl> <chr> 
## 1    160     52 female
## 2    185     90 male  
## 3    170     75 male

Note: Nominal variables need to be factors!

data <- data %>% mutate_if(is.character, factor)
data
## # A tibble: 3 x 3
##   height weight sex   
##    <dbl>  <dbl> <fct> 
## 1    160     52 female
## 2    185     90 male  
## 3    170     75 male
library(proxy)
## 
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
## 
##     as.matrix
## The following objects are masked from 'package:stats':
## 
##     as.dist, dist
## The following object is masked from 'package:base':
## 
##     as.matrix
d_Gower <- data %>% dist(method="Gower")
d_Gower
##           1         2
## 2 1.0000000          
## 3 0.6684211 0.3315789

Note: Gower’s distance automatically scales, so no need to scale the data first.

Using Euclidean distance with mixed data

Sometimes methods (e.g., k-means) only can use Euclidean distance. In this case, nominal features can be converted into 0-1 dummy variables. Euclidean distance on these will result in a usable distance measure.

Create dummy variables

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:sampling':
## 
##     cluster
## The following object is masked from 'package:purrr':
## 
##     lift
data_dummy <- dummyVars(~., data) %>% predict(data)
data_dummy
##   height weight sex.female sex.male
## 1    160     52          1        0
## 2    185     90          0        1
## 3    170     75          0        1

Since sex has now two columns, we need to weight them by 1/2 after scaling.

weight <- matrix(c(1,1,1/2,1/2), ncol = 4, nrow = nrow(data_dummy), byrow = TRUE)
data_dummy_scaled <- scale(data_dummy) * weight

d_dummy <- data_dummy_scaled %>% dist()
d_dummy
##          1        2
## 2 3.064169         
## 3 1.890931 1.426621

Distance is (mostly) consistent with Gower’s distance (other than that Gower’s distance is scaled between 0 and 1).

ggplot(tibble(d_dummy, d_Gower), aes(x = d_dummy, y = d_Gower)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## Don't know how to automatically pick scale for object of type dist. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type dist. Defaulting to continuous.
## `geom_smooth()` using formula 'y ~ x'

Additional proximity measures available in package proxy

library(proxy)
pr_DB$get_entries() %>% names()
##  [1] "Jaccard"         "Kulczynski1"     "Kulczynski2"     "Mountford"      
##  [5] "Fager"           "Russel"          "simple matching" "Hamman"         
##  [9] "Faith"           "Tanimoto"        "Dice"            "Phi"            
## [13] "Stiles"          "Michael"         "Mozley"          "Yule"           
## [17] "Yule2"           "Ochiai"          "Simpson"         "Braun-Blanquet" 
## [21] "cosine"          "eJaccard"        "eDice"           "correlation"    
## [25] "Chi-squared"     "Phi-squared"     "Tschuprow"       "Cramer"         
## [29] "Pearson"         "Gower"           "Euclidean"       "Mahalanobis"    
## [33] "Bhjattacharyya"  "Manhattan"       "supremum"        "Minkowski"      
## [37] "Canberra"        "Wave"            "divergence"      "Kullback"       
## [41] "Bray"            "Soergel"         "Levenshtein"     "Podani"         
## [45] "Chord"           "Geodesic"        "Whittaker"       "Hellinger"      
## [49] "fJaccard"

Relationship between features

Correlation (for ratio/interval scaled features)

Pearson correlation between features (columns)

cc <- iris %>% select(-Species) %>% cor()

ggplot(iris, aes(Petal.Length, Petal.Width)) + geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

with(iris, cor(Petal.Length, Petal.Width))
## [1] 0.9628654

Note: with is the same as cor(iris$Petal.Length, iris$Petal.Width)

with(iris, cor.test(Petal.Length, Petal.Width))
## 
##  Pearson's product-moment correlation
## 
## data:  Petal.Length and Petal.Width
## t = 43.387, df = 148, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9490525 0.9729853
## sample estimates:
##       cor 
## 0.9628654
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

with(iris, cor(Sepal.Length, Sepal.Width))
## [1] -0.1175698
with(iris, cor.test(Sepal.Length, Sepal.Width))
## 
##  Pearson's product-moment correlation
## 
## data:  Sepal.Length and Sepal.Width
## t = -1.4403, df = 148, p-value = 0.1519
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.27269325  0.04351158
## sample estimates:
##        cor 
## -0.1175698

Rank correlation (for ordinal features)

convert to ordinal variables with cut (see ? cut) into ordered factors with three levels

iris_ord <- iris %>% mutate_if(is.numeric,
  function(x) cut(x, 3, labels = c("short", "medium", "long"), ordered = TRUE))

iris_ord
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##    <ord>        <ord>       <ord>        <ord>       <fct>  
##  1 short        medium      short        short       setosa 
##  2 short        medium      short        short       setosa 
##  3 short        medium      short        short       setosa 
##  4 short        medium      short        short       setosa 
##  5 short        medium      short        short       setosa 
##  6 short        long        short        short       setosa 
##  7 short        medium      short        short       setosa 
##  8 short        medium      short        short       setosa 
##  9 short        medium      short        short       setosa 
## 10 short        medium      short        short       setosa 
## # … with 140 more rows
summary(iris_ord)
##  Sepal.Length Sepal.Width Petal.Length Petal.Width       Species  
##  short :59    short :47   short :50    short :50   setosa    :50  
##  medium:71    medium:88   medium:54    medium:54   versicolor:50  
##  long  :20    long  :15   long  :46    long  :46   virginica :50
iris_ord %>% pull(Sepal.Length)
##   [1] short  short  short  short  short  short  short  short  short  short 
##  [11] short  short  short  short  medium medium short  short  medium short 
##  [21] short  short  short  short  short  short  short  short  short  short 
##  [31] short  short  short  short  short  short  short  short  short  short 
##  [41] short  short  short  short  short  short  short  short  short  short 
##  [51] long   medium long   short  medium medium medium short  medium short 
##  [61] short  medium medium medium medium medium medium medium medium medium
##  [71] medium medium medium medium medium medium long   medium medium medium
##  [81] short  short  medium medium short  medium medium medium medium short 
##  [91] short  medium medium short  medium medium medium medium short  medium
## [101] medium medium long   medium medium long   short  long   medium long  
## [111] medium medium long   medium medium medium medium long   long   medium
## [121] long   medium long   medium medium long   medium medium medium long  
## [131] long   long   medium medium medium long   medium medium medium long  
## [141] medium long   medium long   medium medium medium medium medium medium
## Levels: short < medium < long

Kendall’s tau rank correlation coefficient

iris_ord %>% select(-Species) %>% sapply(xtfrm) %>% cor(method="kendall")
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1437985    0.7418595   0.7295139
## Sepal.Width    -0.1437985   1.0000000   -0.3298796  -0.3154474
## Petal.Length    0.7418595  -0.3298796    1.0000000   0.9198290
## Petal.Width     0.7295139  -0.3154474    0.9198290   1.0000000

Spearman’s rho

iris_ord %>% select(-Species) %>% sapply(xtfrm) %>% cor(method="spearman")
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1569659    0.7937613   0.7843406
## Sepal.Width    -0.1569659   1.0000000   -0.3662775  -0.3517262
## Petal.Length    0.7937613  -0.3662775    1.0000000   0.9399038
## Petal.Width     0.7843406  -0.3517262    0.9399038   1.0000000

Note: unfortunately we have to transform the ordered factors into numbers representing the order with xtfrm first.

Compare to the Pearson correlation on the original data

iris %>% select(-Species) %>% cor()
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

Relationship between nominal and ordinal features

Is sepal length and species related? Use cross tabulation

tbl <- iris_ord %>% select(Sepal.Length, Species) %>% table()
tbl
##             Species
## Sepal.Length setosa versicolor virginica
##       short      47         11         1
##       medium      3         36        32
##       long        0          3        17
# doing this is a little more involved using tidyverse
iris_ord %>%
  select(Species, Sepal.Length) %>%
  pivot_longer(cols = Sepal.Length) %>%
  group_by(Species, value) %>% count() %>% ungroup() %>%
  pivot_wider(names_from = Species, values_from = n)
## # A tibble: 3 x 4
##   value  setosa versicolor virginica
##   <ord>   <int>      <int>     <int>
## 1 short      47         11         1
## 2 medium      3         36        32
## 3 long       NA          3        17

Test of Independence: Pearson’s chi-squared test is performed with the null hypothesis that the joint distribution of the cell counts in a 2-dimensional contingency table is the product of the row and column marginals. (h0 is independence)

tbl %>% chisq.test()
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 111.63, df = 4, p-value < 2.2e-16

Using xtabs instead

x <- xtabs(~Sepal.Length + Species, data = iris_ord)
x
##             Species
## Sepal.Length setosa versicolor virginica
##       short      47         11         1
##       medium      3         36        32
##       long        0          3        17
summary(x)
## Call: xtabs(formula = ~Sepal.Length + Species, data = iris_ord)
## Number of cases in table: 150 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 111.63, df = 4, p-value = 3.262e-23

Group-wise averages

iris %>% group_by(Species) %>% summarize_at(vars(Sepal.Length), mean)
## # A tibble: 3 x 2
##   Species    Sepal.Length
## * <fct>             <dbl>
## 1 setosa             5.01
## 2 versicolor         5.94
## 3 virginica          6.59
iris %>% group_by(Species) %>% summarize_all(mean)
## # A tibble: 3 x 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
## * <fct>             <dbl>       <dbl>        <dbl>       <dbl>
## 1 setosa             5.01        3.43         1.46       0.246
## 2 versicolor         5.94        2.77         4.26       1.33 
## 3 virginica          6.59        2.97         5.55       2.03

Density estimation

Just plotting the data is not very helpful

ggplot(iris, aes(Petal.Length, 1:150)) + geom_point()

Histograms work better

ggplot(iris, aes(Petal.Length)) +
  geom_histogram() +
  geom_rug(alpha = 1/10)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Kernel density estimate KDE

ggplot(iris, aes(Petal.Length)) +
  geom_rug(alpha = 1/10) +
  geom_density()

Plot 2d kernel density estimate

ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
  geom_jitter() +
  geom_density2d()

ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
  geom_bin2d(bins = 10) +
  geom_jitter(color = "red")

ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
  geom_hex(bins = 10) +
  geom_jitter(color = "red")